Expert Systems with Applications — Latest Matching Preprints

1

How can AI be compatible with evidence-based medicine?: with an example of analysis of lung cancer recurrence

Usuzaki, T.; Matsunbo, E.; Inamori, R.

2026-04-25 radiology and imaging 10.64898/2026.04.17.26351114 medRxiv

Top 0.1%

11.8%

Show abstract

Despite the remarkable progress of artificial intelligence represented by large language models, how AI technologies can contribute to the construction of evidence in evidence-based medicine (EBM) remains an overlooked issue. Now, we need an AI that can be compatible with EBM. In the present paper, we aim to propose an example analysis that may contribute to this approach using variable Vision Transformer.

2

CerViX-Net: A Multi-Branch Fusion of Vision Transformer and Convolutional Neural Networks for Cervical Cancer Detection using Cytology Images

De, S.

2026-06-24 radiology and imaging 10.64898/2026.06.24.26356425 medRxiv

Top 0.1%

3.9%

Show abstract

Cervical cancer represents a pressing global health challenge, emphasizing the critical need for accurate and timely diagnostic methods to facilitate effective treatment and improve survival rates. In response to this challenge, the study presents CerViX-Net, an innovative classification framework designed to advance cervical cancer detection through enhanced computational efficiency and diagnostic accuracy. The development of CerViX-Net is motivated by the limitations of traditional diagnostic models, particularly in handling the computational and memory demands of large-scale data, while ensuring precise feature extraction and classification. CerViX-Net employs a hybrid deep learning architecture that combines the capabilities of ResNet50, EfficientNet-B0, and a Modified Vision Transformer (ViT) module. The ResNet50 branch extracts hierarchical features through stacked convolutional and identity blocks. In another path, the modified ViT module transforms image patches via linear projection, augments them with positional and class embeddings, and processes them using Parallel Transformer Encoder layers to model contextual relationships. Concurrently, EfficientNet-B0 utilizes MBConv blocks to extract multi-scale representations. The feature outputs from all three branches are integrated and passed through a classification head consisting of dropout layers and dense layers to ensure robust and accurate predictions. The proposed framework is rigorously evaluated on the Mendeley LBC dataset, achieving exceptional performance metrics with an accuracy of 99.69%, precision of 99.28%, recall of 99.48%, and an F1-score of 99.52%. The robustness of CerViX-Net is further validated on the SIPaKMeD and Herlev Pap Smear datasets, where it demonstrates comparable excellence, underscoring its efficacy and adaptability across diverse cytology datasets. Statistical validation using Friedman's test further reinforces its superiority over competing methods.

3

Acceptability and Perceptions of Artificial Intelligence in Organized Breast Cancer Screening: A Study of French Women

Jean, A.; Merceron, A.; Le Saux, A.; Mercier, E.; Benillouche, P.

2026-06-09 radiology and imaging 10.64898/2026.06.07.26354883 medRxiv

Top 0.1%

3.4%

Show abstract

This study aims to assess women's perceptions of artificial intelligence (AI) used in breast cancer screening in France by examining their knowledge of AI and the barriers to their participation in organized screening. The results of a survey conducted in June 2025 among a national sample of 2000 women (aged 40-75) reveal limited participation and persistent concerns among women. Nevertheless, despite a low awareness of specific AI applications, a large majority of the women surveyed are very favorable to the use of AI in breast cancer diagnosis, even considering it a lever to increase screening participation.

4

A Digital Twin for Tracking and Forecasting Glycemia with Septic Patients in ICUs

Cao, X.; Wei, X.; Hou, J.; cai, c.; Wang, Q.

2026-05-04 endocrinology 10.64898/2026.04.24.26351177 medRxiv

Top 0.1%

2.0%

Show abstract

We present a digital twin framework for real-time glucose monitoring and forecasting in septic patients in intensive care units (ICUs). The framework combines advanced machine learning models trained on continuous glucose measurements with a dynamic transfer-learning workflow that enables rapid deployment to individual patients and supports personalized, adaptive, and predictive clinical decision-making. Built on a foundation model--a pretrained time-series transformer--the digital twin continuously updates its parameters as new patient data arrive and produces rolling near-term forecasts in real time. To assess adaptability and computational efficiency, we deployed the pretrained model to ten septic patients and evaluated multiple retraining strategies, including zero-shot inference, linear probing, and full and staged fine-tuning. Results show that the model can be initialized and personalized for a new patient within seconds on a standard laptop while achieving accurate glucose forecasts under varying data conditions. These findings demonstrate the feasibility of real-time model personalization in resource-constrained, high-acuity environments and highlight the potential of digital twins as scalable, AI-enabled platforms for continuous physiological monitoring, clinical decision support, and individualized treatment design in the ICU.

5

Graph Neural Networks (GNNs) for Protein-Ligand Interaction Prediction

Khilar, S.; Natarajan, E.

2026-04-24 bioinformatics 10.64898/2026.04.23.720519 medRxiv

Top 0.1%

1.9%

Show abstract

Predicting protein-ligand interactions in the modern drug discovery has revolved from the involvement of artificial intelligence and structural bioinformatics using Graph Neural Networks (GNNs). The limited explainability of GNN models presents an important encumbrance in biomedical research, but it has achieved a high degree of accuracy in determining and identifying binding affinity and active compounds, as evidenced by [1] [2] [3] [4]. Here this research focuses on the interpretation of protein-ligand interactions at a molecular level, a rapidly developing area within Graph Neural Networks (GNNs). Now days modern study handling techniques such as visualization techniques, attention mechanism and model-based feature ascription by model to boost, and make robust and decrease false predictions on binding. Along with some approaches include like graph pooling strategies, message-passing optimization, self-supervised learning, transfer learning and contrastive learning are rapidly utilized to enhance the representative learnings. Furthermore, integration of molecular docking simulations, hybrid deep learning architectures and protein language model gives more reliable & biological predictions of protein-ligand interactions. That focuses on given process that identifies key ligand atoms and binding residues, as well as physicochemical factors influencing affinity, through chemical thought processes. Here this research work identified the challenges of developing biologically significant explanations, transparency, and the corollary dataset biases on interpretability. The research work conducted an in-depth investigation into the consolidation of protein language models to establish more reliable pathways for future research, examining hybrid architectures, transparent and energy-efficient GNNs, and scientifically grounded AI models for drug discovery. My research work highlights that XGNNs establishes a connection between Deep Learning and Biochemical expertise with increased confidence, which will enhance the accuracy of predictive models and computational models.

6

An Interpretable Multimodal Framework for Student Mental Health Risk Assessment Using Temporal Embeddings and Fuzzy Inference

Shah, A.; Mehta, A.; Bhensdadia, C. K.

2026-05-20 health informatics 10.64898/2026.05.16.26352630 medRxiv

Top 0.2%

1.7%

Show abstract

Mental health challenges among university students have increased due to academic pressure, lifestyle changes, and continuous digital engagement. Existing approaches for mental health assessment often rely either on self-reported psychological scales or isolated behavioral indicators, limiting their ability to capture complex temporal and contextual patterns. This study proposes an interpretable multimodal framework for student mental health risk assessment using behavioral sensing, academic information, ecological momentary assessments (EMA), and psychometric survey data. A bidirectional Long Short-Term Memory autoencoder is employed to learn latent temporal representations from day-level behavioral sequences, while graph embeddings capture structural relationships among students using similarity-based neighborhood graphs. These representations are fused with academic and survey-derived features and reduced using Principal Component Analysis and Uniform Manifold Approximation and Projection. K-means clustering is then applied to identify behaviorally distinct student groups. Experimental analysis on the StudentLife dataset demonstrates meaningful clustering performance with a Silhouette Score of 0.4209 and Adjusted Rand Index stability of 0.6869. The identified clusters correspond to low-risk, moderate-risk, and high-risk behavioral profiles. To improve interpretability and practical usability, a fuzzy inference system is introduced to compute mental risk, academic risk, and wellbeing indices using psychometric indicators including PHQ-9, PSS, PANAS, VR-12, and Big Five personality traits. The results demonstrate the potential of combining multimodal behavioral modeling with interpretable fuzzy reasoning to support early mental health risk assessment in educational settings.

7

MedZone Embedder: a framework for representation learning of Japanese secondary medical care areas from a national ICU registry, characterizing intensive care provision structure and regional vulnerability

Ohno, K.; Hashimoto, S.

2026-07-20 health informatics 10.64898/2026.07.17.26358373 medRxiv

Top 0.2%

1.5%

Show abstract

Background: In Japan, acute inpatient care is divided into approximately 335 secondary medical care areas, which serve as the basic units for planning healthcare delivery systems under the 8th National Health Care Plan. While comparisons between regions and facilities typically rely on a single risk-adjusted metric, this approach confuses differences in patient demographics with differences in the actual infrastructure of intensive care units (ICUs). This paper presents a framework - MedZone Embedder - for deriving data-driven indicators of regional structural vulnerability by mapping secondary medical care areas onto a learned similarity space, together with its working implementation. The paper sets out the concept, the method, a proof of concept, and an explicit staged validation program, rather than national empirical results. Methods: Each area is represented by a feature vector consisting of aggregated values of intensive care provision indicators derived directly from the Japan Intensive Care Patient Database (JIPAD) - specifically, risk-adjusted mortality rates (standardized mortality ratios and an in-hospital composite indicator), technical efficiency, length of stay, readmission rates, case severity, and case composition - with the within-area variance of these indicators also taken into account. No hierarchical processing by facility type is performed. A contrastive autoencoder (multilayer perceptron encoder 32 -> 16 -> 8, symmetric decoder) is trained by self-supervised learning, using an objective function that combines reconstruction and normalized temperature cross-entropy (NT-Xent) on noise-augmented views. The resulting 8-dimensional embedding supports area searches based on cosine similarity and anomaly scoring in the embedding space (using isolation forest, Mahalanobis distance, or k-nearest-neighbor density), which is normalized to a vulnerability score ranging from 0 to 1. If deep learning libraries are unavailable, or if the number of areas is small, an alternative method using deterministic principal component analysis is employed. Results: This method was implemented and deployed within an operational ICU decision support system on a managed cloud platform. The proof of concept (PoC) is structured around five secondary medical care areas within Kyoto Prefecture and runs entirely on synthetic facility-level aggregate data constructed to follow the JIPAD indicator schema; no registry data were accessed. It generated: an aggregate provision profile for each area; an area embedding space equipped with a similar-area search function; and a vulnerability ranking that identifies areas with low patient numbers and low diversity that exhibit overall poor outcomes. At this scale, the contrastive autoencoder falls back to principal component projection. The deep learning pathway has been implemented and unit testing has been completed; training and evaluation on actual registry data are pending data-use approval and the expansion of data integration. Validation is staged: Stage 2 will train the contrastive pathway over JIPAD-covered areas to assess construct validity against public structural indicators (ICU/HCU beds, population, accessibility), and Stage 3 will extend coverage to all areas via National Database (NDB) linkage. Conclusion: MedZone Embedder reframes regional comparison from single-indicator ranking to structural representation: which areas are alike, and which are structural outliers. The contribution of this paper is the framework - the proposal that the intensive care provision structure of Japanese secondary medical care areas can be learned from a national outcomes registry and read through the lens of what we call institutional debt - together with a deployed implementation and a pre-specified validation program. To our knowledge, this is a candidate first application of contrastive representation learning to Japanese secondary medical care areas.

8

Multi-Agent AI for Chest Radiography: A Sequential Segmentation and LLM-Driven Consultative Tool for Medical Training

Kurt, F.; Subasi, A.

2026-06-01 health informatics 10.64898/2026.05.29.26354432 medRxiv

Top 0.2%

1.4%

Show abstract

Background: Traditional diagnostic models lack explainability, while multimodal language models prone to hallucination remain unsafe for medical education. An interactive, risk-free artificial intelligence framework is required to serve as a reliable clinical mentor for radiology trainees. Methods: We propose a multi-agent architecture decoupling deterministic image analysis from generative consultation. Specialized computer vision models perform anatomical localization and pathological segmentation. These quantitative outputs are synthesized into a structured payload, which grounds a locally hosted large language model (LLaVA 7B) using strict prompt guardrails and prerequisite protocols. Results: The system effectively eliminates visual hallucinations by intercepting unanchored queries. The artificial intelligence tutor successfully contextualizes spatial anomalies and baseline metrics, generating accurate conversational explanations and formally structured radiology reports while strictly enforcing medical safety disclaimers. Discussion and Conclusion: By anchoring language generation exclusively to verified algorithmic realities, this framework transforms opaque diagnostic models into safe, interactive educational simulators. This establishes a highly reliable paradigm for integrating explainable artificial intelligence into medical training.

9

Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions

Lin, G.; Miao, R.; Sacheck, J.; Zhang, X.

2026-05-21 public and global health 10.64898/2026.05.18.26353525 medRxiv

Top 0.2%

1.3%

Show abstract

Physical activity (PA) plays an important role in maintaining and improving health. Daily steps have been a key PA measure that is easily accessible with common wearable devices. However, methods are lacking to recommend a personalized optimal distribution of daily steps over a period of time for the best of certain health biomarkers. In this paper, we fill this void based on the data from the All of Us Research Program which includes months of step counts as well as repeated measurements of key health biomarkers. We develop a new offline reinforcement learning (RL) algorithm to learn personalized and optimal PA distributions associated with cardiometabolic risk, where the action is a function representing the daily step distribution over a period of time. Simulation studies demonstrate the advantage of the proposed approach over existing continuous-action RL methods. The learned optimal policy from the All of Us data generally suggests people take more daily steps and also follow a more consistent pattern of PA over time while offering tailored recommendations for subgroups in blood glucose level, body mass index, blood pressure, age, and sex.

10

Modeling the Effectiveness of Antibiotic Therapies Against Sepsis Using Continuous-time Hidden Markov Models

Schmiegel, S.; Marchi, H.; Borgstedt, R.; Rehberg, S.; Fuchs, C.; Mews, S.

2026-07-10 health informatics 10.64898/2026.07.03.26357092 medRxiv

Top 0.2%

1.1%

Show abstract

Patients suffering from sepsis need to be treated with an effective antibiotic therapy within the first hour after sepsis onset to decrease their risk of death. Microbiological data that provide information about the suitability of antibiotic therapies, however, is usually available only after 72 hours. Consequently, the treating physicians need to judge a therapy's effectiveness based on the patients' measured health records and their general health condition. This medical assessment is complex and requires years of experience. In our study, we investigate how statistical modeling can contribute to assessing the effectiveness of antibiotic therapies. To that purpose, we describe the effectiveness of antibiotic therapies by modeling sepsis patients' health conditions using a three-state continuous-time hidden Markov model (ctHMM). In literature, procalcitonin (PCT) and lactate have proven to be helpful for deriving the health condition in this context. The state probabilities obtained by the ctHMM are subsequently used to quantify the effectiveness of antibiotic therapies. To this end, we apply two different approaches, namely (i) averaging of the state probabilities and (ii) a logistic regression model. For (i), we calculate the average of the state probabilities for the state indicating a sepsis-free condition over an antibiotic administration period of 48 hours. For (ii), we use the information about antibiotic susceptibility testings as dependent variable in the logistic regression model; as independent variables, we calculate the difference between state probabilities at the start of antibiotic administration and 48 hours later. With this work, we are able to better understand the relationship between laboratory values, in particular PCT and lactate, and the patients' health condition. We further provide approaches for quantifying the effectiveness. Therefore, our work contributes to developing a clinical decision support system which helps physicians assess the effectiveness of antibiotic therapies in patients with sepsis. Supported by such a system, a physician is able to quickly adjust an ineffective therapy which avoids antibiotic resistances and increases a patient's chance to survive a sepsis.

11

Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline

Alsaiari, A.; Turki, T.; Taguchi, Y.-h.

2026-05-04 bioinformatics 10.64898/2026.04.29.721782 medRxiv

Top 0.2%

1.1%

Show abstract

Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AI-based pipeline when tackling prediction problems pertaining to gynecologic cancer studies. MSC92B05; 68T09

12

A Consensus-Driven Stacking Ensemble Framework for Interpretable Cardiovascular Risk Prediction and Clinical Deployment

Sozol, S. S.; Dev Nath, B. C.; Fahim, F. M. S.; Suzana, N. N.; Mirza, J. F.; Ahmmed, S.; Zohra, F.-T.; Zafr, A. H. A.; Uddin, M. N.; Mondal, M. R. H.; Hoque, A. S. M. L.

2026-05-26 health informatics 10.64898/2026.05.18.26352989 medRxiv

Top 0.3%

1.0%

Show abstract

Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.

13

Algorithmic implementation of pancreatic cancer staging guidelines: comparison with a retrieval-augmented large language model

Komaba, A.; Amakawa, A.; Tozuka, R.; Sato, J.; Fujihara, K.; Emoto, M.; Sawada, S.; Kasai, S.; Sakamoto, K.; Shimura, K.; Johno, Y.; Nakamoto, K.; Ichikawa, S.; Johno, H.

2026-07-02 radiology and imaging 10.64898/2026.06.30.26356912 medRxiv

Top 0.4%

0.9%

Show abstract

Purpose: To implement a comprehensive knowledge-based algorithm (KBA) for pancreatic cancer staging based on the current Japanese guidelines and to evaluate its performance as a clinical decision support system in comparison with a retrieval-augmented large language model (LLM) system. Materials and methods: A KBA covering TNM classification, stage classification, and resectability classification was implemented as a web application. The correctness of the system outputs was exhaustively verified for all possible inputs. Subsequently, six non-board-certified radiologists performed pancreatic cancer staging for 12 simulated cases with imaging findings under three conditions: unassisted, LLM-assisted, and KBA-assisted. Staging accuracy and staging time were compared among the three conditions using pairwise proportion z-tests and Welch's t-tests, respectively. Results: In the comparative experiment, staging accuracy was 81.9%, 80.6%, and 98.6% in the unassisted, LLM-assisted, and KBA-assisted conditions, respectively. Mean staging time was 229.2, 401.9, and 196.2 s, respectively. The KBA-assisted condition showed higher accuracy than both the unassisted and LLM-assisted conditions (both p<0.001). Staging time was longer in the LLM-assisted condition than in the other two conditions (both p<0.001). Conclusion: A comprehensive KBA for pancreatic cancer staging based on the current Japanese guidelines was implemented and exhaustively verified. In a preliminary comparative experiment, KBA assistance improved staging accuracy without increasing staging time, whereas LLM assistance increased staging time without improving staging accuracy. These findings suggest that verified KBA systems may be feasible and useful for clinical tasks governed by explicit guideline-based rules.

14

An Efficient and Interpretable Learning Approach for Large-Scale Histopathology Data

Moore, C.; Gupta, V.; Neupane, S.; Tripathi, H.

2026-05-03 health informatics 10.64898/2026.04.30.26352196 medRxiv

Top 0.4%

0.9%

Show abstract

Prostate cancer (PCa) remains one of the leading causes of cancer-related mortality among men, and histopathological analysis of prostate biopsy specimens is central to diagnosis and risk stratification. Whole-slide Images (WSIs) capture rich morphological information, but their gigapixel scale and the large number of extracted tissue patches make exhaustive annotation and model training computationally expensive. Attention-based Multiple Instance Learning (MIL) has emerged as an effective weakly supervised framework for WSI analysis, enabling slide-level prediction without requiring patch-level annotations. However, training MIL models on large histopathology cohorts remains resource intensive because many extracted patches are non-informative, and some patches are often processed repeatedly during training. To address these challenges, we propose an efficient and interpretable learning framework for large-scale histopathology analysis. Our method combines a pathology-pretrained UNI encoder, a Clustering-constrained Attention Multiple instance learning-Single Branch (CLAM-SB) attention-based MIL model, and a window-based training strategy that reduces computational overhead while preserving predictive performance. The paper illustrates our proposed approach and experiments on TCGA-PRAD WSIs for the PCa patients. Processing 189,600 sampled patches across 79 WSIs with our proposed approach reduced total training time by 57.5% (20 to 8.5 hours for 5 epochs) and 41.4% (27 to 16 hours for 10 epochs), respectively, underscoring its potential as a practical and resource-efficient strategy for scalable prostate histopathology analysis.

15

Benchmarking Speech Recognition Models for Medical Consultations in Latin American Spanish: A Comparative Evaluation with Fine-Tuning

Carrillo, R. M.; Carbajal Serrano, A.; Condori Pinedo, P. S.

2026-07-16 public and global health 10.64898/2026.07.14.26358062 medRxiv

Top 0.4%

0.9%

Show abstract

BACKGROUND: Artificial intelligence (AI) medical scribes rely on speech-to-text (STT) models for transcription. Evaluations of STT models in non-English settings remain scarce. We benchmarked ten STT models on medical consultations from Latin American (LatAm) Spanish and assessed whether fine-tuning improves transcription accuracy. METHODS: Ten YouTube videos depicting medical consultations. Human transcriptions were the ground truth. Five open-source models were evaluated: Whisper Large, Whisper Large v3, Whisper Large v3 Turbo, Voxtral Mini 3B, and Canary 1B v2; and so were five close-source models: gpt-4o-transcribe, gpt-4o-mini-transcribe, gemini-2.5-pro, Eleven Labs, and Assembly AI. Whisper Large v3 was fine-tuned. One video was withheld from training. Performance assessed using Word Error Rate (WER), Character Error Rate (CER), BLEU Score, ROUGE-L, BERT Score, and Semantic Similarity on the one withheld video. RESULTS: None of the fine-tuning iterations outperformed the vanilla Whisper Large v3. With the withheld video, Gemini-2.5-pro was the close-source model with the best performance in four of six metrics. In comparison to the close-source models, the fine-tuned model never outperformed the other models (withheld video); conversely, in comparison to the close-source models, the fine-tuned model showed better performance across metrics, for instance: BLEU score (63% vs to 58% for the second-ranking model), BERT (89% vs to 86%), and semantic similarity (89% vs to 83%), CER (19% vs 20%). CONCLUSIONS: Whisper Large v3 and its fine-tuned variant are the best open-source STT models for transcribing medical conversations in LatAm Spanish. These findings provide an evidence base for developing AI medical scribes tailored to Spanish-speaking LatAm.

16

Protocol-Guided Cross-Domain Transfer Learning for Bovine Facial Pain Recognition under Weak Dairy-Farm Labels

Patel, S.; Neethirajan, S.

2026-06-23 animal behavior and cognition 10.64898/2026.06.18.733162 medRxiv

Top 0.4%

0.9%

Show abstract

Livestock welfare models are developed under controlled experimental conditions but deployed across farms, breeds, management systems and label regimes, where reliability remains uncertain. We introduce the Protocol-Driven Transfer Evaluation (PDTE) framework, which treats the adaptation protocol, comprising label mapping, objective design, domain alignment, model selection, calibration and threshold policy, as the experimental variable and evaluates transfer through animal-level external validation with uncertainty quantification. We apply PDTE to a bovine welfare task involving transfer of a facial pain representation from postoperative beef cattle to dairy cows under shifts in breed, sex, production system, clinical etiology, recording environment and label fidelity. Using an author-collected Canadian Holstein and Jersey dataset with an independent eight-cow test cohort, direct source-domain transfer was weak, with sequence AUC 0.418 and cow-level AUC 0.400. PDTE identified two failure modes under weak supervision: threshold collapse, in which adaptation converges to a single prediction class, and calibration-induced collapse, in which score ranking is preserved while decision behavior deteriorates. Across protocols, objective design dominated performance. Class-balanced focal adaptation achieved stable operating behavior (sequence AUC 0.611; cow-level AUC 0.667), while a target-only model attained comparable performance without source initialization (sequence AUC 0.596; paired p = 0.984), indicating that protocol design and operating-point choices contributed more than pretraining under weak-label conditions. Animal-level uncertainty remained substantial, with a bootstrap 95% confidence interval of 0.20 to 1.00, exceeding the transfer effect. These findings show that transferability limits cannot be inferred from source-domain performance alone and require protocol-controlled, uncertainty-aware evaluation in livestock AI.

17

Combined values alignment and epistemic verification prevent delusional reinforcement in conversational AI agents

Carrano, A.; Patel, M. S.; Hartono, S.; Ekker, S. C.

2026-06-02 health informatics 10.64898/2026.05.29.26354389 medRxiv

Top 0.4%

0.8%

Show abstract

Conversational AI is being deployed into medical decision support, mental-health triage, and social companionship, where reinforcement of a user's false or delusional belief can cause direct harm. Most deployed safety techniques are evaluated for factual accuracy in isolation; the question of whether they protect against belief-level harm, and whether layered architectures behave additively or synergistically, has not been answered empirically. We compared four configurations of the same underlying model: a bare language model (condition A); an explicit values constraint we call the First Law architecture (condition B); a real-time epistemic verification layer called Aletheia (condition C); and the complete architecture combining all components together (condition D). Across 156 scored responses spanning 39 probe items in four belief-harm domains, condition A only passed 3 of 36 main-battery probes (8.3%; 95% CI 1.8 to 22.5%) under triple-blind human consensus rating demonstrating the core limitations of unmodified LLM deployments. In contrast, the three safety architectures (B-D) passed at least 97% of items (Fisher's exact, P < 0.001 versus A). On a synergy battery designed to test items at the intersection of value- and epistemic-domain failures (16 scored items, AI-rated), only the complete architecture passed every item; single-layer conditions failed on 7 of 16 items (43.8%) where neither values constraint nor verification was individually sufficient. Linear mixed-effects modelling of three-turn emotional escalation gave a slope of -1.00 points per turn for the values-only condition (t = -6.20) and -0.75 points per turn for the verification-only condition (t = -4.65); the complete architecture was flat at {beta} = 0.00. We describe a mechanistic failure of single-layer verification we call bot-validates-kernel-endorses-inference, in which accurate confirmation of a true factual element embedded in a delusional claim transfers epistemic authority to the surrounding false inference. Values alignment and factual verification address different failure modes, and the combined VaaS-Aletheia architecture is what produces stable protection across emotional escalation in conversational settings. The complete architecture evaluated here represents evidence-based specification for safer deployment of AI in high-stakes advisory contexts and serves as a benchmark against which future safety architectures can be compared.

18

A Hybrid Framework for Accurate Melanoma Diagnosis: Leveraging Generative AI with Enhanced CNN+ Architectures

Wu, Y.; Zhang, B.; Yan, Y.; Li, J.; Wu, Y.; Kim, S. S.; Huang, K.; Ye, Q.; Yu, Y.; Tong, G.

2026-04-28 dermatology 10.64898/2026.04.27.26351813 medRxiv

Top 0.4%

0.8%

Show abstract

Melanocytes become cancerous, forming tumors that may invade and destroy the surrounding tissues. When melanocytes acquire invasive characteristics, the anchored melanoma begins to damage the normal cells. Therefore, early intervention and diagnosis are essential to avoid high morbidity and mortality in malignant melanoma. However, It is challenging to distinguish the difference between malignant melanoma and benign clump of melanocytes. Based on a data set of 10,000 melanocyte tumors, this paper develops a new model system to improve the accuracy of distinguishing between benign and malignant melanocytes. In the first stage, the original CNN architectures are used, such as ResNet18, ResNet50, VGG11, and VGG16. Synthetic medical images, generated via a Diffusion Model to extract informative features from the original dataset, are used to train the CNN architectures. This approach improves classification accuracy from 91.1% to 92.9%. In the second stage, the fully connected layer of each neural network is replaced with a high-level classifier, XGBoost, to perform secondary classification. This hybrid strategy further enhances performance, achieving up to 93.3% accuracy by using the synthetic images.

19

MAE-UNETR++: Masked Autoencoder Pretraining for 3-D Lung Nodule Segmentation

Savant, V.; Wang, Y.; Xuan, J.

2026-06-19 bioengineering 10.64898/2026.06.17.733000 medRxiv

Top 0.4%

0.8%

Show abstract

Voxel-level annotation for volumetric medical imaging is expensive and difficult to scale, which makes training highcapacity 3-D segmentation models challenging in practice. Transfer learning (TL) from large public datasets is a common remedy, but it can under-perform when the source domain differs from the target anatomy and acquisition characteristics, as is often the case for pulmonary nodules. In this work, we propose a masked autoencoder (MAE) pretraining-based approach to break the data efficiency wall of domain difference and present a focused empirical study of domain-specific self-supervised learning (SSL) for 3-D lung nodule segmentation. We evaluate two experimental settings: first, Masked Autoencoder (MAE) pretraining versus random initialization across representative baselines; second, MAE versus Decathlon TL for UNETR++ while testing whether MAE-based pretraining also benefits a CNN baseline (V-Net). MAE pretraining on target-domain CT volumes achieves a Dice Similarity Coefficient (DSC) of 0.307, outperforming random initialization (0.136) and Decathlon weights (0.257). In addition, MAE improves the stability of V-Net in a "low-data" regime (i.e., with "insufficiently labeled" data), increasing DSC from 0.010 to0.071. Overall, these results suggest that MAE-based pretraining can provide a practical and robust initialization strategy for volumetric segmentation when labeled data are limited.

20

Failure detection in medical image classification under realistic distribution shifts: A large-scale benchmark

Steinmetz, P.; Frouin, F.; Morard, V.; Buvat, I.

2026-05-05 radiology and imaging 10.64898/2026.05.04.26350496 medRxiv

Top 0.4%

0.8%

Show abstract

Medical images (MI) exhibit variability due to different acquisition protocols, devices, and patient populations, making failure detection at inference time essential for reliable deployment of clinical classifiers. As existing evaluations of failure detection methods use different settings, it is difficult to compare results and identify the best strategy, if any. We present a comprehensive benchmark of eight confidence scoring functions and two score-aggregation strategies across eight MI tasks spanning diverse modalities, backbone architectures, training setups, and failure sources. The confidence ranking ability and classification error mitigation are jointly evaluated. While no single method systematically dominated across settings, aggregation of confidence scores consistently matched or approached the best individual method and substantially reduced silent failure rate. The failure detection performance was strongly correlated with classifier accuracy for all tested settings. These findings provide large-scale evidence regarding the strengths and limitations of confidence scoring strategies and offer actionable guidance for mitigating silent failures under realistic distribution shifts in MI.